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Abstract 

Background: Reverse docking approaches have been explored in previous studies on drug discovery to overcome 
some problems in traditional virtual screening. However, current reverse docking approaches are problematic in 
that the target spaces of those studies were rather small, and their applications were limited to identifying new 
drug targets. In this study, we expanded the scope of target space to a set of all protein structures currently 
available and developed several new applications of reverse docking method. 

Results: We generated 2D Matrix of docking scores among all the possible protein structures in yeast and human 
and 35 famous drugs. By clustering the docking profile data and then comparing them with fingerprint-based 
clustering of drugs, we first showed that our data contained accurate information on their chemical properties. 
Next, we showed that our method could be used to predict the druggability of target proteins. We also showed 
that a combination of sequence similarity and docking profile similarity could predict the enzyme EC numbers 
more accurately than sequence similarity alone. In two case studies, 5-flurouracil and cycloheximide, we showed 
that our method can successfully find identifying target proteins. 

Conclusions: By using a large number of protein structures, we improved the sensitivity of reverse docking and 
showed that using as many protein structure as possible was important in finding real binding targets. 



Background 

Identifying disease genes and target proteins of drugs is a 
critical step in drug discovery. Once the disease genes are 
identified, designing lead compounds which can modu- 
late those genes or the protein products may lead to a 
successful new drug. The growth of the number of avail- 
able 3D structures of proteins and computing power has 
enabled high-throughput computational screening of 
lead compounds, which is known as virtual screening. 
Conventionally, these virtual screening methods have 
focused on searching chemical space for chemicals that 
can specifically bind to a protein target [1]. 

Complication in this structure-based drug discovery 
strategy is that there may exist unknown off-target 
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proteins that can bind to the lead compounds unexpect- 
edly, which undoubtedly poses some difficulty such as 
severe side effect, but also provides a new opportunity. 
Upon discovering novel drug targets for existing drugs, 
we can expand indications of the drugs by drug reposi- 
tioning. Motivated by this, reverse (or inverse) docking 
approaches have received increasing interest to find 
unknown targets of natural products and existing old 
drugs [2-4]. In reverse docking, one tries to find the 
protein targets which can bind to a particular ligand. 

In previous researches, based on an assumption that 
the number of predicted potential protein targets [5] is 
quite low compared to the number of genes, they tried to 
find new drug targets among a relatively small number of 
potential target proteins. For example, a reverse docking 
study by Gao et al. used -1,100 targets [6], and that by 
Hui-fang et al., used 1,714 targets and 8 compounds [7]. 
However, this may cause poor coverage of the protein 
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structure space in reverse docking. Moreover, their only 
intended application of their reverse docking methods is 
to find the targets of drugs. On the other hand, various 
approaches including statistical method using sequence 
and structure similarity [8], calculating binding site simi- 
larity [9,10], and prediction of druggability by descriptors 
[11] have been developed. 

Here, we present a large-scale reverse docking study. 
The main difference from previous studies is that we used 
all available protein structures in human and yeast. To our 
best knowledge, our docking profile contains the largest 
number of protein structures. The reverse docking profile 
was merged into a matrix which can be easily interpreta- 
ble. We showed the some properties of the large-scale 
docking profile and demonstrated usefulness of these 
docking profile data. We also developed several new appli- 
cations such as predicting druggability of protein targets 
and protein function prediction based on docking profile 
similarity. We discussed two interesting case studies, 
5-flurouracil and cycloheximide. Especially, we successfully 
demonstrated that using as many protein structures as 
possible was important in improving the sensitivity of 
reverse docking and finding real binding targets. 

Results and discussion 

Data structure 

All docking scores of ligand-biding site pairs were 
merged into a matrix form. In total, yeast profiles and 
human profiles are composed of 1,165 and 10,886 bind- 
ing sites, respectively with thirty-five ligands (Table 1). 
The numeric data are available in Additional files 1, 2. 
The resulting docking profiles were hierarchically clus- 
tered in both directions for the further analysis. The 
clustering results are shown in Figure 1. 



Table 1 The list of ligands used to generate reverse 
docking profiles 



Index 


Ligand 


PubChem CID 


1 


5-FC 


3366 


2 


5-FU 


3385 


3 


Brefeldin A 


5287620 


4 


Camptothecin 


24360 


5 


Chlorpromazine 


2726 


6 


Cimetidine 


2756 


7 


Clotrimazole 


2812 


8 


Cycloheximide 


6197 


9 


Dipyridamole 


3108 


10 


Doxorubicin 


31703 


11 


Dyclonine 


3180 


12 


Fluvastatin 


446155 


13 


Gemfibrozil 


3463 


14 


Haloperidol 


3559 


15 


Hydrocortisone 


5754 



Table 1 The list of ligands used to generate reverse 
docking profiles (Continued) 
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Pravastatin 
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54687 


26 


Procaine 


1 548986 


27 


Progesterone 


5994 


28 


Radicicol 


6323491 


29 


Sulfamethoxazole 


5329 


30 


Sulfinpyrazone 


5342 


31 


Tamoxifen 


2733526 


32 


Terbinafine 


1 549008 


33 


Theophylline 


2153 


34 


Tunicamycin 


6433557 


35 


Valproic acid 


3121 



The dendrograms along the column (chemical space) in 
both species are very similar to each other. We compared 
these dendrograms with two dendrograms from Pub- 
Chem Structure Clustering [12] based on a measure 
recently developed for comparing two hierarchical clus- 
tering [13]. PubChem provides two kinds of clustering 
based on 2D structure fingerprint and 3D shape/feature 
similarity. Our two dendrograms are more similar with 
the clustering based on 3D similarity than that based on 
2D similarity (Table 2). The reason may be that 3D con- 
formations of ligands are more relevant for the protein- 
ligand docking fitness than 2D information. We also sug- 
gest that the topology of ligand clusters can be used as a 
new similarity measure between two small molecules. 
Nevertheless, relatively high similarity scores indicate that 
our docking profile data convey accurate information on 
their chemical properties. 

Druggablity analysis 

The "druggability" of a certain target protein represents 
how probable the protein is in fact a real target of 
drugs, and it has been investigated in many previous 
studies [14-16]. In one such method, the druggability of 
a protein was inferred from its homologous proteins 
whose druggabilities were already known [17]. The 
weakness of this method is that the number of targets 
with known druggability is limited. Other approaches 
attempted to define "draggable" as "highly likely to bind 
to putative drugs", i.e., "bindability" [18,19]. 
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Figure 1 Color mapped docking fitness profile matrix in (a) yeast and (b) human The elements represent corresponding docking fitness. 
The numbers in horizontal axis corresponds to ligands in Table 1. Each row presents a protein binding site. There are (a) 1,165 and (b) 10,886 
rows. The dendrograms in both axes were constructed by hierarchical clustering. 



In the context of bindability, the docking profiles in 
this study can provide good large scale simulated data. 
We first checked whether our data were in accord with 
predefined druggability dataset [20]. The non-redundant 



set of druggable and less druggable (NRDLD) set con- 
tains 71 druggable and 44 less-druggable targets. Since 
not all the entries are human proteins, the numbers of 
overlapped targets are 43 druggable and 8 less-druggable. 
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Table 2 The similarities among hierarchical clusterings in 
ligand space. 





Human 


Yeast 


PC 2D 


PC 3D 


Human 


1 


0.939 


0.513 


0.708 


Yeast 


0.939 


1 


0.504 


0.712 


PC 2D 


0.513 


0.504 


1 


0.565 


PC 3D 


0.708 


0.712 


0.565 


1 



Rows and columns "Human" and "Yeast" represent clustering based on 
docking profile similarity in human and yeast, respectively. "PC 2D" and 
"PC 3D" represents clustering from PubChem based on 2D structure 
fingerprint and 3D conformation similarity. 



fitness scores of those ligands (see Methods), we predicted 
539 putative druggable binding sites and 289 less- 
druggable binding sites (Additional file 3). The predicted 
druggable and less-druggable binding sites were classified 
into 6 enzyme classes according to the first digit of their 
EC numbers [21] (Figure 3). Oxidoreductases occur more 
frequently in the druggable. Hydrolase, lyases and iso- 
merases occur more frequently in the less-druggable set. 
Except for the case of ligase in which all relative frequen- 
cies are less than 0.1, the enzyme class distribution trends 
are similar in both the predicted set and NRDLD set. 



Figure 2 shows the average docking fitness values of the 
sets of those overlapped targets, along with the averages 
of 10,886 binding sites' docking scores in human data set. 
Except for the case of cimetidine (Ligand 6), all averages 
of our docking scores of druggable set are greater than 
those of less-druggable set. One can observe that overall 
average values are placed between druggable and less- 
druggable sets in nearly all cases. This result suggests 
that without serious training it may be possible to classify 
all protein targets whose docking profiles are available 
into druggable or less-druggable target. 

Next, by simply comparing docking fitness scores of 
each protein target against 35 ligands with the average 



Protein function prediction based on docking profile 
similarity 

An advantage of the present study is that docking profiles 
are available for both human and yeast proteomes. Yeast 
is the best characterized eukaryotic model organism. 
A variety of related resources such as chemical genomic 
profile [22], whole genome knock-out library [23], pro- 
tein-protein interaction and genetic interaction data [24] 
are available. We expect that docking scores generated in 
this study can be combined with these resources to infer 
novel protein targets. In addition, it is valuable to check 
whether orthologous protein pairs of human and yeast 
share similar docking profiles. In Figure 4, we plot two 
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Figure 2 Average docking scores of NRDLD druggable and less druggable set and those of overall docking fitness The numbers in 
horizontal axis represents the corresponding ligands in Table 1. The subplot represents difference between NRDLD druggable and less 
druggable (druggable minus less druggable). The values in subplot always show positive values except for small negative values (-0.13) in Ligand 
6 (cimetidine). 
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Figure 3 Distributions of the enzyme classes. Distributions of the enzyme classes in the NR Putative druggable and less druggable set 
(assigned in this study; red bars), and NRDLD druggable and less druggable set (green bars). Note that the enzyme class distribution trends are 
similar in both sets. 
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distributions of Euclidean distances of docking scores of 
both species; one for the orthologous pairs and the other 
for all human-yeast pairs. It is observed that sequence 
similarity is generally reflected in the similarity of dock- 
ing profile across the two species. 

The results shown in Figure 4 suggest that we can uti- 
lize our docking method to infer the function of proteins, 
especially the proteins that have no apparent orthologs 
with known function. To show that docking profiles con- 
tain the information which can be utilized to predict the 
function of proteins, we carried out a large-scale function 
prediction of enzymes. For 3,874,883 pairs of 5,989 
human proteins and 647 yeast proteins, we collected all 
pairs for which EC numbers were available. We used 
BLAST e-value [25] as the sequence similarity measure, 
and Euclidean distance between the two docking profiles 
as the docking profile similarity measure. The perfor- 
mances are shown in Figure 5 as receiver operating char- 
acteristic (ROC) curve. In low false positive rate (FDR) 
region, using e-value yielded better performance, while 
docking profile similarity scores performed better in high 
FDR region. This limitation is due to substantial overlap 
between distance distributions of positive and negative 
pairs as shown in Figure 4. Although the average and 
median of distances of positive pairs are quite less than 



those of negative pairs, any single non-parametric Eucli- 
dean distance value cannot divide two groups perfectly. 
Relationship between docking fitness distance and 
enzyme function is more complex than a single thresh- 
old. However, it is observed that docking profile similar- 
ity contain positive information for function prediction 
which is not overlapped with the sequence information. 
Thus, this information would be used as a useful feature 
with combination of other features such as sequence 
similarity, structural similarity, and binding site 
similarity. 

Here, for example, simple implementation of combina- 
tion of sequence and docking profile information was 
tested. To cover low sensitivity of docking fitness in low 
FDR, a new distance was defined as follows: if BLAST 
e-value of a pair is less than le-5, e-value is used as the 
distance; if otherwise, Euclidean distance is used. The 
performance of this metric is shown in Figure 5 (red). 
Note that this simple metric is never based on any ser- 
ious training, feature extraction, or machine learning 
technique. Not considering which elements in 35- 
dimentional docking profile are important, and simply 
adding information of docking profile exhibits better 
performances in all area. In summary, this implies that 
using docking profile information together with other 




Sequence similarity 

Docking profile similarity 

Both 



0.4 0.6 0.8 1 

False positive rate (1 -specificity) 

Figure 5 ROC curve for assigning EC numbers (up to fourth digit) of proteins using those of nearest proteins in the other species 

Distances are based on BLAST e-value (sequence similarity), Euclidean distance of docking profiles (docking profile similarity), and hybrid of both. 
False positive rate is FP/(FP+TN), and true positive rate is TP/(TP+FN) where FP, TN, TP, FN are the numbers of false positives, true negatives, true 
positives, and false negatives, respectively. 
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useful measures as features of state-of-the-art machine 
learning technique and increasing the size of docking 
profile, i.e., appending the reverse docking results of 
additional ligands would get close to more precise func- 
tion prediction of proteins. 

Case studies 

The docking profile data generated in this study can be 
applied in a variety of ways. As discussed in the previous 
section, it can be utilized to infer protein function. On 
the other hand, more common application that has been 
explored in several previous studies is to infer new bind- 
ing targets for known drugs. Here, we present two case 
studies. 

Binding target of 5-FC and 5-FU 

5-fluorocytosine (5-FC) and 5-flurouracil (5-FU) are 
both fluorinated analogues of pyrimidine [26]. The 
structures of the two ligands are quite similar. There- 
fore, not surprisingly, the docking profiles are quite 
similar as well. Moreover, the top-ranking binding site 
of both ligands is the structure of yeast exosome com- 
ponent, the protein product of gene rrp6 (PDB id: 
2hbm) [27]. The structure was identified relatively 
recently, so 2hbm has never been annotated as putative 
target, not to mention druggable. Previously known 
mechanism of action of 5-FU is inhibition of thymidy- 
late synthetase [28]. Thus, the top-ranking structure, 
2hbm, might be considered as a false positive. Fortu- 
nately, however, genome-wide study using tagged het- 
erozygotes yeast mutants provided a strong evidence 
that rrp6 related rRNA processing exosome is a target 
of 5-FU [29]. The direct binding target of 5-FU was not 
identified in the previous study, but the result of that 
research and the docking scores strongly suggest that 
the protein product of rrp6 is the direct binding target 
of 5-FU in yeast. 

Protein structures from the same sequence 

Similarly to the case of 5-FU, we also investigated the 
high-ranking targets in docking profile of cycloheximide 
(CHX). The top-ranking structure is the PDB structure 
lql7 which is the protein structure of yeast gene Hst2, 
homologous to eukaryotic SIR2 [30]. Interestingly, 
among three protein products (lql7, lql4, lqla) of hst2 
whose structures were identified by the same researchers 
[31,32], lql7 and lqla exhibited high binding fitness (1st 
and 6th ranked) while lql4 showed poor binding affinity. 
We tried to find what caused these differences. It is 
known that lql7 and lqla lack the 64 residue C- terminal 
tails of hst2 sequence in common while lql4 is the struc- 
ture of intact Hst2 (Figure 6a). We also found that there 
was the comparison study between yeast mutant strains, 
which lack corresponding C-terminal tail regions and 



wild type treating CHX [33] (Figure 6b). In that study, 
expression of HST2-298A which corresponds to PDB 
lql7 or lqla led to increased sensitivity to CHX. This 
phenotype is surprisingly well characterized in docking 
profile in our study as the top-ranked docking fitness 
value. 

Are such a large number of protein structures necessary 
for reverse docking? 

Compared to the protein sets used in previous studies, 
the set used in this study is quite large and has some 
redundancy. One may question whether all these struc- 
tures contribute to the sensitivity of reverse docking. It 
is an important issue because docking still costs high 
computing power and is time-consuming. 

In our dataset of human, 8,717 structures out of 10,886 
structures have the hits sharing the same UniProt ID 
with 1,339 unique UniProt IDs. In other words, those 
8,717 structures could be reduced into 1,339 structures 
by removing at most 7,378 structures if we filter the set 
with respect to only sequence redundancy. However, 
there are many cases where docking fitness profiles for 
similar sequences are quite different. 

To show this property, we first carried out hierarchical 
clustering of docking profile of proteins. For each sub- 
cluster, if all the members were derivatives of the same 
UniProt ID, the members were merged into one. This 
procedure was repeated until there were no sub-clusters 
in which members shared the same UniProt ID. As a 
result, 1,710 structures were filtered out eventually, i.e., 
only about 20% of sequence-redundant protein structures 
exhibited the redundancy in docking profile. This is due 
to heterogeneity in PDB. There are many modified struc- 
tures such as oxidized, reduced, multimeric, metal con- 
taining, and truncated forms for even a one protein 
sequence. Thus, we concluded that the sets of protein 
structures which were used in previous reverse docking 
studies are insufficient. For example, the interesting 
results from the docking of cycloheximide, which was 
discussed in the previous section, would have not been 
obtained. 

Another interesting example is the main binding target 
of hydrocortisone, the glucocorticoid receptor (GCR). 
There are nine structures of the GCR in PDB. However, 
datasets used for reverse docking such as potential drug 
target database (PDTD) [6] included only two of them 
(PDB lnhz and lp93). The result of reverse docking of 
hydrocortisone by others [7] using PDTD could not 
detect the GCR as the target. In our docking profile, PDB 
3bqd was the top-ranking protein target, which is 
another structure of the GCR. If we had removed redun- 
dancy based on sequence similarity, we could have not 
detected the real target of the GCR. Therefore, our 
reverse docking experiment suggests that using as many 
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Figure 6 Sequence variation among derivatives of yeast gene Hst2. (a) C-termial sequences of intact Hst2 (uniprot ID: P53686) and those of 
its three structures. 1q17 and 1 q 1 a are the structure of the 64 residue C-terminal deletion construct, (b) HST2-298A mutant strain only contains 
amino acids 1-298, with the deletion of C-terminal end. Regenerated from the previous work by others [33]. 



as possible protein structures in reverse docking is 
worthwhile in finding unknown drug targets or unex- 
pected mode-of-action even though it costs high compu- 
tation cost. 

Conclusions 

In this study, we generated large-scale reverse docking 
profiles for all X-ray protein structures in human and 
yeast. These data can be the reference for future binding 
assays and used to find unexpected binding targets of 
drugs. Furthermore, it would be useful to find unknown 
therapeutic uses in drug repositioning. In some case stu- 
dies, targets not annotated as druggable or not stored in 
target database previously exhibit high binding fitness 
and they are highly likely to be real binding targets con- 
sidering previous functional experiments. By using a 
large number of protein structures, we improved the 
sensitivity of reverse docking and showed that using as 
many protein structure as possible was important in 
finding real binding targets. Although we used as small 
as 35 ligands in docking, we were able to demonstrate 
some usefulness of our data. Generating this kind of 
reverse docking profile of a large number of ligands 
would be valuable in the future study. 

Methods 

Data preparation 

All available X-ray protein structures in human and 
budding yeast Saccharomyces cerevisiae were retrieved 
from RCSB Protein Data Bank (PDB) [34,35]. The best 
putative binding sites of each PDB structure were gener- 
ated by using the program Fpocket [36,37]. To make 
pockets appropriate inputs for the docking, Open Babel 
[38] was used to protonate all the pockets. Thirty-five 
well-known ligands (Table 1) were manually selected 
from previous high-throughput experimental studies 
[29,39] to perform high-throughput reverse docking 



after excluding some ligands that were too large or 
small for molecular docking study. The 3D structures of 
the ligands were retrieved from PubChem [40] and con- 
verted from sdf file [41] into Tripos mol2 file format. 

Docking 

All the protonated pockets were docked against the 
ligand set using GOLD [42]. We used a 'flexible ligand- 
rigid protein' mode. All other options involved in 
GOLD's search algorithm and termination factor were 
set to the default options. Given several putative docking 
conformations, we only chose the highest-ranking bind- 
ing pose for each ligand-biding site pair. The GOLD fit- 
ness value [43] was used as a measure of the binding 
fitness. As a result, 10,886 x 35 matrix and 1,165 x 35 
matrix of docking fitness scores for human and yeast, 
respectively, were made and used in this study (Addi- 
tional files 1, 2). 

Druggability analysis 

Predefined the non-redundant set of druggable and less 
druggable binding sites (NRDLD set) was retrieved from 
the study by Krasowski et al. [20]. Among 71 druggable 
binding sites and 44 less-druggable ones in NRDLD set, 
43 druggable and 8 less-druggable binding sites are 
overlapped with human protein structures used in this 
study. These 51 binding sites were used for druggability 
analysis. 

Putative druggable and less-druggable protein binding 
sites were assigned by the following rules: a binding site 
is druggable when all 35 docking values of the binding 
site are always larger than corresponding overall average 
values, and less-druggable when all 35 docking values of 
a binding site are always less than corresponding aver- 
age values. 

To get EC number composition of assigned druggable 
and less-druggable sets, non-redundant (NR) putative 
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druggable and less druggable sets were defined. In this 
study, NR set means that the set do not contain any 
pairs of proteins sharing the same UniProt ID [44]. 
Note that we did not use any sequence identity measure 
to remove redundancy. 

Ortholog mapping 

Although there are several ortholog databases, none of 
those provides PDB-based mapping table. Therefore, we 
obtained the ortholog mapping between human and 
yeast protein structures by the following procedure. 
First, we retrieved human-yeast ortholog table from 
InParanoid [45,46]. In this table, human proteins and 
yeast proteins were annotated by Ensembl's id (ENSP) 
[47] and yeast ORF name [48], respectively. These terms 
were transferred into PDB id by PICR [49] to complete 
PDB-based mapping. 

Additional material 



Additional file 1: Docking profiles for human protein structures A 

csv document containing all docking scores of human protein structures 
with ligands used in this study. This file can be viewed with Microsoft 
Excel or any text editor. 

Additional file 2: Docking profiles for yeast protein structures A csv 

document containing all docking scores of yeast protein structures with 
ligands used in this study. This file can be viewed with Microsoft Excel or 
any text editor. 

Additional file 3: The list of putative drugggable and less 
druggable targets based on docking profile The list of putative 
drugggable and less druggable targets based on docking profile. The file 
contains PDB ID, corresponding EC number, and assigned druggability. 
This file can be viewed with Microsoft Excel or any text editor. 
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